Robotics 41
☆ Neural Circuit Architectural Priors for Quadruped Locomotion
Learning-based approaches to quadruped locomotion commonly adopt generic
policy architectures like fully connected MLPs. As such architectures contain
few inductive biases, it is common in practice to incorporate priors in the
form of rewards, training curricula, imitation data, or trajectory generators.
In nature, animals are born with priors in the form of their nervous system's
architecture, which has been shaped by evolution to confer innate ability and
efficient learning. For instance, a horse can walk within hours of birth and
can quickly improve with practice. Such architectural priors can also be useful
in ANN architectures for AI. In this work, we explore the advantages of a
biologically inspired ANN architecture for quadruped locomotion based on neural
circuits in the limbs and spinal cord of mammals. Our architecture achieves
good initial performance and comparable final performance to MLPs, while using
less data and orders of magnitude fewer parameters. Our architecture also
exhibits better generalization to task variations, even admitting deployment on
a physical robot without standard sim-to-real methods. This work shows that
neural circuits can provide valuable architectural priors for locomotion and
encourages future work in other sensorimotor skills.
☆ VIRT: Vision Instructed Transformer for Robotic Manipulation
Zhuoling Li, Liangliang Ren, Jinrong Yang, Yong Zhao, Xiaoyang Wu, Zhenhua Xu, Xiang Bai, Hengshuang Zhao
Robotic manipulation, owing to its multi-modal nature, often faces
significant training ambiguity, necessitating explicit instructions to clearly
delineate the manipulation details in tasks. In this work, we highlight that
vision instruction is naturally more comprehensible to recent robotic policies
than the commonly adopted text instruction, as these policies are born with
some vision understanding ability like human infants. Building on this premise
and drawing inspiration from cognitive science, we introduce the robotic
imagery paradigm, which realizes large-scale robotic data pre-training without
text annotations. Additionally, we propose the robotic gaze strategy that
emulates the human eye gaze mechanism, thereby guiding subsequent actions and
focusing the attention of the policy on the manipulated object. Leveraging
these innovations, we develop VIRT, a fully Transformer-based policy. We design
comprehensive tasks using both a physical robot and simulated environments to
assess the efficacy of VIRT. The results indicate that VIRT can complete very
competitive tasks like ``opening the lid of a tightly sealed bottle'', and the
proposed techniques boost the success rates of the baseline policy on diverse
challenging tasks from nearly 0% to more than 65%.
☆ Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making NeurIPS 2024
Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, Jiajun Wu
We aim to evaluate Large Language Models (LLMs) for embodied decision making.
While a significant body of work has been leveraging LLMs for decision making
in embodied environments, we still lack a systematic understanding of their
performance because they are usually applied in different domains, for
different purposes, and built based on different inputs and outputs.
Furthermore, existing evaluations tend to rely solely on a final success rate,
making it difficult to pinpoint what ability is missing in LLMs and where the
problem lies, which in turn blocks embodied agents from leveraging LLMs
effectively and selectively. To address these limitations, we propose a
generalized interface (Embodied Agent Interface) that supports the
formalization of various types of tasks and input-output specifications of
LLM-based modules. Specifically, it allows us to unify 1) a broad set of
embodied decision-making tasks involving both state and temporally extended
goals, 2) four commonly-used LLM-based modules for decision making: goal
interpretation, subgoal decomposition, action sequencing, and transition
modeling, and 3) a collection of fine-grained metrics which break down
evaluation into various types of errors, such as hallucination errors,
affordance errors, various types of planning errors, etc. Overall, our
benchmark offers a comprehensive assessment of LLMs' performance for different
subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI
systems, and providing insights for effective and selective use of LLMs in
embodied decision making.
comment: Accepted for oral presentation at NeurIPS 2024 in the Datasets and
Benchmarks track
☆ Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology ICLR 2025
Xiangyu Wang, Donglin Yang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hongsheng Li, Yue Liao, Si Liu
Developing agents capable of navigating to a target location based on
language instructions and visual information, known as vision-language
navigation (VLN), has attracted widespread interest. Most research has focused
on ground-based agents, while UAV-based VLN remains relatively underexplored.
Recent efforts in UAV vision-language navigation predominantly adopt
ground-based VLN settings, relying on predefined discrete action spaces and
neglecting the inherent disparities in agent movement dynamics and the
complexity of navigation tasks between ground and aerial environments. To
address these disparities and challenges, we propose solutions from three
perspectives: platform, benchmark, and methodology. To enable realistic UAV
trajectory simulation in VLN tasks, we propose the OpenUAV platform, which
features diverse environments, realistic flight control, and extensive
algorithmic support. We further construct a target-oriented VLN dataset
consisting of approximately 12k trajectories on this platform, serving as the
first dataset specifically designed for realistic UAV VLN tasks. To tackle the
challenges posed by complex aerial environments, we propose an assistant-guided
UAV object search benchmark called UAV-Need-Help, which provides varying levels
of guidance information to help UAVs better accomplish realistic VLN tasks. We
also propose a UAV navigation LLM that, given multi-view images, task
descriptions, and assistant instructions, leverages the multimodal
understanding capabilities of the MLLM to jointly process visual and textual
information, and performs hierarchical trajectory generation. The evaluation
results of our method significantly outperform the baseline models, while there
remains a considerable gap between our results and those achieved by human
operators, underscoring the challenge presented by the UAV-Need-Help task.
comment: Under review as a conference paper at ICLR 2025
☆ FlowBotHD: History-Aware Diffuser Handling Ambiguities in Articulated Objects Manipulation
We introduce a novel approach to manipulate articulated objects with
ambiguities, such as opening a door, in which multi-modality and occlusions
create ambiguities about the opening side and direction. Multi-modality occurs
when the method to open a fully closed door (push, pull, slide) is uncertain,
or the side from which it should be opened is uncertain. Occlusions further
obscure the door's shape from certain angles, creating further ambiguities
during the occlusion. To tackle these challenges, we propose a history-aware
diffusion network that models the multi-modal distribution of the articulated
object and uses history to disambiguate actions and make stable predictions
under occlusions. Experiments and analysis demonstrate the state-of-art
performance of our method and specifically improvements in ambiguity-caused
failure modes. Our project website is available at
https://flowbothd.github.io/.
comment: Accepted to CoRL 2024
☆ RM4D: A Combined Reachability and Inverse Reachability Map for Common 6-/7-axis Robot Arms by Dimensionality Reduction to 4D ICRA 2025
Knowledge of a manipulator's workspace is fundamental for a variety of tasks
including robot design, grasp planning and robot base placement. Consequently,
workspace representations are well studied in robotics. Two important
representations are reachability maps and inverse reachability maps. The former
predicts whether a given end-effector pose is reachable from where the robot
currently is, and the latter suggests suitable base positions for a desired
end-effector pose. Typically, the reachability map is built by discretizing the
6D space containing the robot's workspace and determining, for each cell,
whether it is reachable or not. The reachability map is subsequently inverted
to build the inverse map. This is a cumbersome process which restricts the
applications of such maps. In this work, we exploit commonalities of existing
six and seven axis robot arms to reduce the dimension of the discretization
from 6D to 4D. We propose Reachability Map 4D (RM4D), a map that only requires
a single 4D data structure for both forward and inverse queries. This gives a
much more compact map that can be constructed by an order of magnitude faster
than existing maps, with no inversion overheads and no loss in accuracy. Our
experiments showcase the usefulness of RM4D for grasp planning with a mobile
manipulator.
comment: Submitted to ICRA 2025. See project page:
https://mrudorfer.github.io/rm4d/
☆ Control System Design and Experiments for Autonomous Underwater Helicopter Docking Procedure Based on Acoustic-inertial-optical Guidance
A control system structure for the underwater docking procedure of an
Autonomous Underwater Helicopter (AUH) is proposed in this paper, which
utilizes acoustic-inertial-optical guidance. Unlike conventional Autonomous
Underwater Vehicles (AUVs), the maneuverability requirements for AUHs are more
stringent during the docking procedure, requiring it to remain stationary or
have minimal horizontal movement while moving vertically. The docking procedure
is divided into two stages: Homing and Landing, each stage utilizing different
guidance methods. Additionally, a segmented aligning strategy operating at
various altitudes and a linear velocity decision are both adopted in Landing
stage. Due to the unique structure of the Subsea Docking System (SDS), the AUH
is required to dock onto the SDS in a fixed orientation with specific attitude
and altitude. Therefore, a particular criterion is proposed to determine
whether the AUH has successfully docked onto the SDS. Furthermore, the
effectiveness and robustness of the proposed control method in AUH's docking
procedure are demonstrated through pool experiments and sea trials.
☆ Combining Planning and Diffusion for Mobility with Unknown Dynamics ICRA 2025
Manipulation of large objects over long horizons (such as carts in a
warehouse) is an essential skill for deployable robotic systems. Large objects
require mobile manipulation which involves simultaneous manipulation,
navigation, and movement with the object in tow. In many real-world situations,
object dynamics are incredibly complex, such as the interaction of an office
chair (with a rotating base and five caster wheels) and the ground. We present
a hierarchical algorithm for long-horizon robot manipulation problems in which
the dynamics are partially unknown. We observe that diffusion-based behavior
cloning is highly effective for short-horizon problems with unknown dynamics,
so we decompose the problem into an abstract high-level, obstacle-aware
motion-planning problem that produces a waypoint sequence. We use a
short-horizon, relative-motion diffusion policy to achieve the waypoints in
sequence. We train mobile manipulation policies on a Spot robot that has to
push and pull an office chair. Our hierarchical manipulation policy performs
consistently better, especially when the horizon increases, compared to a
diffusion policy trained on long-horizon demonstrations or motion planning
assuming a rigidly-attached object (success rate of 8 (versus 0 and 5
respectively) out of 10 runs). Importantly, our learned policy generalizes to
new layouts, grasps, chairs, and flooring that induces more friction, without
any further training, showing promise for other complex mobile manipulation
problems. Project Page: https://yravan.github.io/plannerorderedpolicy/
comment: Submitted to ICRA 2025
☆ Safe Reinforcement Learning Filter for Multicopter Collision-Free Tracking under disturbances
This paper proposes a safe reinforcement learning filter (SRLF) to realize
multicopter collision-free trajectory tracking with input disturbance. A novel
robust control barrier function (RCBF) with its analysis techniques is
introduced to avoid collisions with unknown disturbances during tracking. To
ensure the system state remains within the safe set, the RCBF gain is designed
in control action. A safety filter is introduced to transform unsafe
reinforcement learning (RL) control inputs into safe ones, allowing RL training
to proceed without explicitly considering safety constraints. The SRLF obtains
rigorous guaranteed safe control action by solving a quadratic programming (QP)
problem that incorporates forward invariance of RCBF and input saturation
constraints. Both simulation and real-world experiments on multicopters
demonstrate the effectiveness and excellent performance of SRLF in achieving
collision-free tracking under input disturbances and saturation.
☆ A Safety Modulator Actor-Critic Method in Model-Free Safe Reinforcement Learning and Application in UAV Hovering
This paper proposes a safety modulator actor-critic (SMAC) method to address
safety constraint and overestimation mitigation in model-free safe
reinforcement learning (RL). A safety modulator is developed to satisfy safety
constraints by modulating actions, allowing the policy to ignore safety
constraint and focus on maximizing reward. Additionally, a distributional
critic with a theoretical update rule for SMAC is proposed to mitigate the
overestimation of Q-values with safety constraints. Both simulation and
real-world scenarios experiments on Unmanned Aerial Vehicles (UAVs) hovering
confirm that the SMAC can effectively maintain safety constraints and
outperform mainstream baseline algorithms.
☆ Dynamic Neural Potential Field: Online Trajectory Optimization in Presence of Moving Obstacles
We address a task of local trajectory planning for the mobile robot in the
presence of static and dynamic obstacles. Local trajectory is obtained as a
numerical solution of the Model Predictive Control (MPC) problem. Collision
avoidance may be provided by adding repulsive potential of the obstacles to the
cost function of MPC. We develop an approach, where repulsive potential is
estimated by the neural model. We propose and explore three possible strategies
of handling dynamic obstacles. First, environment with dynamic obstacles is
considered as a sequence of static environments. Second, the neural model
predict a sequence of repulsive potential at once. Third, the neural model
predict future repulsive potential step by step in autoregressive mode. We
implement these strategies and compare it with CIAO* and MPPI using BenchMR
framework. First two strategies showed higher performance than CIAO* and MPPI
while preserving safety constraints. The third strategy was a bit slower,
however it still satisfy time limits. We deploy our approach on Husky UGV
mobile platform, which move through the office corridors under proposed MPC
local trajectory planner. The code and trained models are available at
\url{https://github.com/CognitiveAISystems/Dynamic-Neural-Potential-Field}.
☆ Discrete time model predictive control for humanoid walking with step adjustment
This paper presents a Discrete-Time Model Predictive Controller (MPC) for
humanoid walking with online footstep adjustment. The proposed controller
utilizes a hierarchical control approach. The high-level controller uses a
low-dimensional Linear Inverted Pendulum Model (LIPM) to determine desired foot
placement and Center of Mass (CoM) motion, to prevent falls while maintaining
the desired velocity. A Task Space Controller (TSC) then tracks the desired
motion obtained from the high-level controller, exploiting the whole-body
dynamics of the humanoid. Our approach differs from existing MPC methods for
walking pattern generation by not relying on a predefined foot-plan or a
reference center of pressure (CoP) trajectory. The overall approach is tested
in simulation on a torque-controlled Humanoid Robot. Results show that proposed
control approach generates stable walking and prevents fall against push
disturbances.
comment: 6 pages, 17 figures, 1 table
☆ Collective perception for tracking people with a robot swarm ICRA
Miquel Kegeleirs, David Garzón Ramos, Guillermo Legarda Herranz, Ilyes Gharbi, Jeanne Szpirer, Olivier Debeir, Ken Hasselmann, Lorenzo Garattoni, Gianpiero Francesca, Mauro Birattari
Swarm perception refers to the ability of a robot swarm to utilize the
perception capabilities of each individual robot, forming a collective
understanding of the environment. Their distributed nature enables robot swarms
to continuously monitor dynamic environments by maintaining a constant presence
throughout the space.In this study, we present a preliminary experiment on the
collective tracking of people using a robot swarm. The experiment was conducted
in simulation across four different office environments, with swarms of varying
sizes. The robots were provided with images sampled from a dataset of
real-world office environment pictures.We measured the time distribution
required for a robot to detect a person changing location and to propagate this
information to increasing fractions of the swarm. The results indicate that
robot swarms show significant promise in monitoring dynamic environments.
comment: Presented at ICRA@40, Rotterdam
☆ OmniPose6D: Towards Short-Term Object Pose Tracking in Dynamic Scenes from Monocular RGB
Yunzhi Lin, Yipu Zhao, Fu-Jen Chu, Xingyu Chen, Weiyao Wang, Hao Tang, Patricio A. Vela, Matt Feiszli, Kevin Liang
To address the challenge of short-term object pose tracking in dynamic
environments with monocular RGB input, we introduce a large-scale synthetic
dataset OmniPose6D, crafted to mirror the diversity of real-world conditions.
We additionally present a benchmarking framework for a comprehensive comparison
of pose tracking algorithms. We propose a pipeline featuring an
uncertainty-aware keypoint refinement network, employing probabilistic modeling
to refine pose estimation. Comparative evaluations demonstrate that our
approach achieves performance superior to existing baselines on real datasets,
underscoring the effectiveness of our synthetic dataset and refinement
technique in enhancing tracking precision in dynamic contexts. Our
contributions set a new precedent for the development and assessment of object
pose tracking methodologies in complex scenes.
comment: 13 pages, 9 figures
☆ Autonomous localization of multiple ionizing radiation sources using miniature single-layer Compton cameras onboard a group of micro aerial vehicles IROS
Michal Werner, Tomáš Báča, Petr Štibinger, Daniela Doubravová, Jaroslav Šolc, Jan Rusňák, Martin Saska
A novel method for autonomous localization of multiple sources of gamma
radiation using a group of Micro Aerial Vehicles (MAVs) is presented in this
paper. The method utilizes an extremely lightweight (44 g) Compton camera
MiniPIX TPX3. The compact size of the detector allows for deployment onboard
safe and agile small-scale Unmanned Aerial Vehicles (UAVs). The proposed
radiation mapping approach fuses measurements from multiple distributed Compton
camera sensors to accurately estimate the positions of multiple radioactive
sources in real time. Unlike commonly used intensity-based detectors, the
Compton camera reconstructs the set of possible directions towards a radiation
source from just a single ionizing particle. Therefore, the proposed approach
can localize radiation sources without having to estimate the gradient of a
radiation field or contour lines, which require longer measurements. The
instant estimation is able to fully exploit the potential of highly mobile
MAVs. The radiation mapping method is combined with an active search strategy,
which coordinates the future actions of the MAVs in order to improve the
quality of the estimate of the sources' positions, as well as to explore the
area of interest faster. The proposed solution is evaluated in simulation and
real world experiments with multiple Cesium-137 radiation sources.
comment: International Conference on Intelligent Robots and Systems (IROS)
2024
☆ M${}^{3}$Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes
We propose M^3Bench, a new benchmark for whole-body motion generation for
mobile manipulation tasks. Given a 3D scene context, M^3Bench requires an
embodied agent to understand its configuration, environmental constraints and
task objectives, then generate coordinated whole-body motion trajectories for
object rearrangement tasks. M^3Bench features 30k object rearrangement tasks
across 119 diverse scenes, providing expert demonstrations generated by our
newly developed M^3BenchMaker. This automatic data generation tool produces
coordinated whole-body motion trajectories from high-level task instructions,
requiring only basic scene and robot information. Our benchmark incorporates
various task splits to assess generalization across different dimensions and
leverages realistic physics simulation for trajectory evaluation. Through
extensive experimental analyses, we reveal that state-of-the-art models still
struggle with coordinated base-arm motion while adhering to environment-context
and task-specific constraints, highlighting the need to develop new models that
address this gap. Through M^3Bench, we aim to facilitate future robotics
research towards more adaptive and capable mobile manipulation in diverse,
real-world environments.
☆ Task Coordination and Trajectory Optimization for Multi-Aerial Systems via Signal Temporal Logic: A Wind Turbine Inspection Study IROS'24
This paper presents a method for task allocation and trajectory generation in
cooperative inspection missions using a fleet of multirotor drones, with a
focus on wind turbine inspection. The approach generates safe, feasible flight
paths that adhere to time-sensitive constraints and vehicle limitations by
formulating an optimization problem based on Signal Temporal Logic (STL)
specifications. An event-triggered replanning mechanism addresses unexpected
events and delays, while a generalized robustness scoring method incorporates
user preferences and minimizes task conflicts. The approach is validated
through simulations in MATLAB and Gazebo, as well as field experiments in a
mock-up scenario.
comment: 2 pages, Accepted for discussion at the workshop session "Formal
methods techniques in robotics systems: Design and control" at IROS'24 in Abu
Dhabi, UAE
☆ Pair-VPR: Place-Aware Pre-training and Contrastive Pair Classification for Visual Place Recognition with Vision Transformers
In this work we propose a novel joint training method for Visual Place
Recognition (VPR), which simultaneously learns a global descriptor and a pair
classifier for re-ranking. The pair classifier can predict whether a given pair
of images are from the same place or not. The network only comprises Vision
Transformer components for both the encoder and the pair classifier, and both
components are trained using their respective class tokens. In existing VPR
methods, typically the network is initialized using pre-trained weights from a
generic image dataset such as ImageNet. In this work we propose an alternative
pre-training strategy, by using Siamese Masked Image Modelling as a
pre-training task. We propose a Place-aware image sampling procedure from a
collection of large VPR datasets for pre-training our model, to learn visual
features tuned specifically for VPR. By re-using the Mask Image Modelling
encoder and decoder weights in the second stage of training, Pair-VPR can
achieve state-of-the-art VPR performance across five benchmark datasets with a
ViT-B encoder, along with further improvements in localization recall with
larger encoders. The Pair-VPR website is:
https://csiro-robotics.github.io/Pair-VPR.
☆ ES-Gaussian: Gaussian Splatting Mapping via Error Space-Based Gaussian Completion
Accurate and affordable indoor 3D reconstruction is critical for effective
robot navigation and interaction. Traditional LiDAR-based mapping provides high
precision but is costly, heavy, and power-intensive, with limited ability for
novel view rendering. Vision-based mapping, while cost-effective and capable of
capturing visual data, often struggles with high-quality 3D reconstruction due
to sparse point clouds. We propose ES-Gaussian, an end-to-end system using a
low-altitude camera and single-line LiDAR for high-quality 3D indoor
reconstruction. Our system features Visual Error Construction (VEC) to enhance
sparse point clouds by identifying and correcting areas with insufficient
geometric detail from 2D error maps. Additionally, we introduce a novel 3DGS
initialization method guided by single-line LiDAR, overcoming the limitations
of traditional multi-view setups and enabling effective reconstruction in
resource-constrained environments. Extensive experimental results on our new
Dreame-SR dataset and a publicly available dataset demonstrate that ES-Gaussian
outperforms existing methods, particularly in challenging scenarios. The
project page is available at https://chenlu-china.github.io/ES-Gaussian/.
comment: Project page: https://chenlu-china.github.io/ES-Gaussian/
☆ Disturbance Observer-based Control Barrier Functions with Residual Model Learning for Safe Reinforcement Learning
Reinforcement learning (RL) agents need to explore their environment to learn
optimal behaviors and achieve maximum rewards. However, exploration can be
risky when training RL directly on real systems, while simulation-based
training introduces the tricky issue of the sim-to-real gap. Recent approaches
have leveraged safety filters, such as control barrier functions (CBFs), to
penalize unsafe actions during RL training. However, the strong safety
guarantees of CBFs rely on a precise dynamic model. In practice, uncertainties
always exist, including internal disturbances from the errors of dynamics and
external disturbances such as wind. In this work, we propose a new safe RL
framework based on disturbance rejection-guarded learning, which allows for an
almost model-free RL with an assumed but not necessarily precise nominal
dynamic model. We demonstrate our results on the Safety-gym benchmark for Point
and Car robots on all tasks where we can outperform state-of-the-art approaches
that use only residual model learning or a disturbance observer (DOB). We
further validate the efficacy of our framework using a physical F1/10 racing
car. Videos: https://sites.google.com/view/res-dob-cbf-rl
☆ Agile Mobility with Rapid Online Adaptation via Meta-learning and Uncertainty-aware MPPI
Modern non-linear model-based controllers require an accurate physics model
and model parameters to be able to control mobile robots at their limits. Also,
due to surface slipping at high speeds, the friction parameters may continually
change (like tire degradation in autonomous racing), and the controller may
need to adapt rapidly. Many works derive a task-specific robot model with a
parameter adaptation scheme that works well for the task but requires a lot of
effort and tuning for each platform and task. In this work, we design a full
model-learning-based controller based on meta pre-training that can very
quickly adapt using few-shot dynamics data to any wheel-based robot with any
model parameters, while also reasoning about model uncertainty. We demonstrate
our results in small-scale numeric simulation, the large-scale Unity simulator,
and on a medium-scale hardware platform with a wide range of settings. We show
that our results are comparable to domain-specific well-engineered controllers,
and have excellent generalization performance across all scenarios.
☆ Real-to-Sim Grasp: Rethinking the Gap between Simulation and Real World in Grasp Detection
For 6-DoF grasp detection, simulated data is expandable to train more
powerful model, but it faces the challenge of the large gap between simulation
and real world. Previous works bridge this gap with a sim-to-real way. However,
this way explicitly or implicitly forces the simulated data to adapt to the
noisy real data when training grasp detectors, where the positional drift and
structural distortion within the camera noise will harm the grasp learning. In
this work, we propose a Real-to-Sim framework for 6-DoF Grasp detection, named
R2SGrasp, with the key insight of bridging this gap in a real-to-sim way, which
directly bypasses the camera noise in grasp detector training through an
inference-time real-to-sim adaption. To achieve this real-to-sim adaptation,
our R2SGrasp designs the Real-to-Sim Data Repairer (R2SRepairer) to mitigate
the camera noise of real depth maps in data-level, and the Real-to-Sim Feature
Enhancer (R2SEnhancer) to enhance real features with precise simulated
geometric primitives in feature-level. To endow our framework with the
generalization ability, we construct a large-scale simulated dataset
cost-efficiently to train our grasp detector, which includes 64,000 RGB-D
images with 14.4 million grasp annotations. Sufficient experiments show that
R2SGrasp is powerful and our real-to-sim perspective is effective. The
real-world experiments further show great generalization ability of R2SGrasp.
Project page is available on https://isee-laboratory.github.io/R2SGrasp.
☆ QuadBEV: An Efficient Quadruple-Task Perception Framework via Bird's-Eye-View Representation
Bird's-Eye-View (BEV) perception has become a vital component of autonomous
driving systems due to its ability to integrate multiple sensor inputs into a
unified representation, enhancing performance in various downstream tasks.
However, the computational demands of BEV models pose challenges for real-world
deployment in vehicles with limited resources. To address these limitations, we
propose QuadBEV, an efficient multitask perception framework that leverages the
shared spatial and contextual information across four key tasks: 3D object
detection, lane detection, map segmentation, and occupancy prediction. QuadBEV
not only streamlines the integration of these tasks using a shared backbone and
task-specific heads but also addresses common multitask learning challenges
such as learning rate sensitivity and conflicting task objectives. Our
framework reduces redundant computations, thereby enhancing system efficiency,
making it particularly suited for embedded systems. We present comprehensive
experiments that validate the effectiveness and robustness of QuadBEV,
demonstrating its suitability for real-world applications.
☆ BiC-MPPI: Goal-Pursuing, Sampling-Based Bidirectional Rollout Clustering Path Integral for Trajectory Optimization
This paper introduces the Bidirectional Clustered MPPI (BiC-MPPI) algorithm,
a novel trajectory optimization method aimed at enhancing goal-directed
guidance within the Model Predictive Path Integral (MPPI) framework. BiC-MPPI
incorporates bidirectional dynamics approximations and a new guide cost
mechanism, improving both trajectory planning and goal-reaching performance. By
leveraging forward and backward rollouts, the bidirectional approach ensures
effective trajectory connections between initial and terminal states, while the
guide cost helps discover dynamically feasible paths. Experimental results
demonstrate that BiC-MPPI outperforms existing MPPI variants in both 2D and 3D
environments, achieving higher success rates and competitive computation times
across 900 simulations on a modified BARN dataset for autonomous navigation.
GitHub: https://github.com/i-ASL/BiC-MPPI
comment: 7 pages, 1 figures
☆ Overcoming Autoware-Ubuntu Incompatibility in Autonomous Driving Systems-Equipped Vehicles: Lessons Learned
Autonomous vehicles have been rapidly developed as demand that provides
safety and efficiency in transportation systems. As autonomous vehicles are
designed based on open-source operating and computing systems, there are
numerous resources aimed at building an operating platform composed of Ubuntu,
Autoware, and Robot Operating System (ROS). However, no explicit guidelines
exist to help scholars perform trouble-shooting due to incompatibility between
the Autoware platform and Ubuntu operating systems installed in autonomous
driving systems-equipped vehicles (i.e., Chrysler Pacifica). The paper presents
an overview of integrating the Autoware platform into the autonomous vehicle's
interface based on lessons learned from trouble-shooting processes for
resolving incompatible issues. The trouble-shooting processes are presented
based on resolving the incompatibility and integration issues of Ubuntu 20.04,
Autoware.AI, and ROS Noetic software installed in an autonomous driving
systems-equipped vehicle. Specifically, the paper focused on common
incompatibility issues and code-solving protocols involving Python
compatibility, Compute Unified Device Architecture (CUDA) installation,
Autoware installation, and simulation in Autoware.AI. The objective of the
paper is to provide an explicit and detail-oriented presentation to showcase
how to address incompatibility issues among an autonomous vehicle's operating
interference. The lessons and experience presented in the paper will be useful
for researchers who encountered similar issues and could follow up by
performing trouble-shooting activities and implementing ADS-related projects in
the Ubuntu, Autoware, and ROS operating systems.
☆ Grounding Robot Policies with Visuomotor Language Guidance
Recent advances in the fields of natural language processing and computer
vision have shown great potential in understanding the underlying dynamics of
the world from large-scale internet data. However, translating this knowledge
into robotic systems remains an open challenge, given the scarcity of
human-robot interactions and the lack of large-scale datasets of real-world
robotic data. Previous robot learning approaches such as behavior cloning and
reinforcement learning have shown great capabilities in learning robotic skills
from human demonstrations or from scratch in specific environments. However,
these approaches often require task-specific demonstrations or designing
complex simulation environments, which limits the development of generalizable
and robust policies for new settings. Aiming to address these limitations, we
propose an agent-based framework for grounding robot policies to the current
context, considering the constraints of a current robot and its environment
using visuomotor-grounded language guidance. The proposed framework is composed
of a set of conversational agents designed for specific roles -- namely,
high-level advisor, visual grounding, monitoring, and robotic agents. Given a
base policy, the agents collectively generate guidance at run time to shift the
action distribution of the base policy towards more desirable future states. We
demonstrate that our approach can effectively guide manipulation policies to
achieve significantly higher success rates both in simulation and in real-world
experiments without the need for additional human demonstrations or extensive
exploration. Project videos at https://sites.google.com/view/motorcortex/home.
comment: 19 pages, 6 figures, 1 table
☆ Enabling Novel Mission Operations and Interactions with ROSA: The Robot Operating System Agent
Rob Royce, Marcel Kaufmann, Jonathan Becktor, Sangwoo Moon, Kalind Carpenter, Kai Pak, Amanda Towler, Rohan Thakker, Shehryar Khattak
The advancement of robotic systems has revolutionized numerous industries,
yet their operation often demands specialized technical knowledge, limiting
accessibility for non-expert users. This paper introduces ROSA (Robot Operating
System Agent), an AI-powered agent that bridges the gap between the Robot
Operating System (ROS) and natural language interfaces. By leveraging
state-of-the-art language models and integrating open-source frameworks, ROSA
enables operators to interact with robots using natural language, translating
commands into actions and interfacing with ROS through well-defined tools.
ROSA's design is modular and extensible, offering seamless integration with
both ROS1 and ROS2, along with safety mechanisms like parameter validation and
constraint enforcement to ensure secure, reliable operations. While ROSA is
originally designed for ROS, it can be extended to work with other robotics
middle-wares to maximize compatibility across missions. ROSA enhances
human-robot interaction by democratizing access to complex robotic systems,
empowering users of all expertise levels with multi-modal capabilities such as
speech integration and visual perception. Ethical considerations are thoroughly
addressed, guided by foundational principles like Asimov's Three Laws of
Robotics, ensuring that AI integration promotes safety, transparency, privacy,
and accountability. By making robotic technology more user-friendly and
accessible, ROSA not only improves operational efficiency but also sets a new
standard for responsible AI use in robotics and potentially future mission
operations. This paper introduces ROSA's architecture and showcases initial
mock-up operations in JPL's Mars Yard, a laboratory, and a simulation using
three different robots. The core ROSA library is available as open-source.
comment: Under review for IEEE Aerospace Conference, 20 pages, 20 figures
☆ LocoVR: Multiuser Indoor Locomotion Dataset in Virtual Reality
Understanding human locomotion is crucial for AI agents such as robots,
particularly in complex indoor home environments. Modeling human trajectories
in these spaces requires insight into how individuals maneuver around physical
obstacles and manage social navigation dynamics. These dynamics include subtle
behaviors influenced by proxemics - the social use of space, such as stepping
aside to allow others to pass or choosing longer routes to avoid collisions.
Previous research has developed datasets of human motion in indoor scenes, but
these are often limited in scale and lack the nuanced social navigation
dynamics common in home environments. To address this, we present LocoVR, a
dataset of 7000+ two-person trajectories captured in virtual reality from over
130 different indoor home environments. LocoVR provides full body pose data and
precise spatial information, along with rich examples of socially-motivated
movement behaviors. For example, the dataset captures instances of individuals
navigating around each other in narrow spaces, adjusting paths to respect
personal boundaries in living areas, and coordinating movements in high-traffic
zones like entryways and kitchens. Our evaluation shows that LocoVR
significantly enhances model performance in three practical indoor tasks
utilizing human trajectories, and demonstrates predicting socially-aware
navigation patterns in home environments.
♻ ☆ TURTLMap: Real-time Localization and Dense Mapping of Low-texture Underwater Environments with a Low-cost Unmanned Underwater Vehicle IROS 2024
Significant work has been done on advancing localization and mapping in
underwater environments. Still, state-of-the-art methods are challenged by
low-texture environments, which is common for underwater settings. This makes
it difficult to use existing methods in diverse, real-world scenes. In this
paper, we present TURTLMap, a novel solution that focuses on textureless
underwater environments through a real-time localization and mapping method. We
show that this method is low-cost, and capable of tracking the robot
accurately, while constructing a dense map of a low-textured environment in
real-time. We evaluate the proposed method using real-world data collected in
an indoor water tank with a motion capture system and ground truth reference
map. Qualitative and quantitative results validate the proposed system achieves
accurate and robust localization and precise dense mapping, even when subject
to wave conditions. The project page for TURTLMap is
https://umfieldrobotics.github.io/TURTLMap.
comment: Accepted to IROS 2024
♻ ☆ The Brain-Inspired Cooperative Shared Control Framework for Brain-Machine Interface
In brain-machine interface (BMI) applications, a key challenge is the low
information content and high noise level in neural signals, severely affecting
stable robotic control. To address this challenge, we proposes a cooperative
shared control framework based on brain-inspired intelligence, where control
signals are decoded from neural activity, and the robot handles the fine
control. This allows for a combination of flexible and adaptive interaction
control between the robot and the brain, making intricate human-robot
collaboration feasible.
The proposed framework utilizes spiking neural networks (SNNs) for
controlling robotic arm and wheel, including speed and steering. While full
integration of the system remains a future goal, individual modules for robotic
arm control, object tracking, and map generation have been successfully
implemented. The framework is expected to significantly enhance the performance
of BMI. In practical settings, the BMI with cooperative shared control,
utilizing a brain-inspired algorithm, will greatly enhance the potential for
clinical applications.
comment: This article need to update the corrected figure and content
♻ ☆ A Unified Generative Framework for Realistic Lidar Simulation in Autonomous Driving Systems
Simulation models for perception sensors are integral components of
automotive simulators used for the virtual Verification and Validation (V\&V)
of Autonomous Driving Systems (ADS). These models also serve as powerful tools
for generating synthetic datasets to train deep learning-based perception
models. Lidar is a widely used sensor type among the perception sensors for ADS
due to its high precision in 3D environment scanning. However, developing
realistic Lidar simulation models is a significant technical challenge. In
particular, unrealistic models can result in a large gap between the
synthesised and real-world point clouds, limiting their effectiveness in ADS
applications. Recently, deep generative models have emerged as promising
solutions to synthesise realistic sensory data. However, for Lidar simulation,
deep generative models have been primarily hybridised with conventional
algorithms, leaving unified generative approaches largely unexplored in the
literature. Motivated by this research gap, we propose a unified generative
framework to enhance Lidar simulation fidelity. Our proposed framework projects
Lidar point clouds into depth-reflectance images via a lossless transformation,
and employs our novel Controllable Lidar point cloud Generative model, CoLiGen,
to translate the images. We extensively evaluate our CoLiGen model, comparing
it with the state-of-the-art image-to-image translation models using various
metrics to assess the realness, faithfulness, and performance of a downstream
perception model. Our results show that CoLiGen exhibits superior performance
across most metrics. The dataset and source code for this research are
available at https://github.com/hamedhaghighi/CoLiGen.git.
♻ ☆ Exploring Human's Gender Perception and Bias toward Non-Humanoid Robots
In this study, we investigate the human perception of gender and bias toward
non-humanoid robots. As robots increasingly integrate into various sectors
beyond industry, it is essential to understand how humans engage with
non-humanoid robotic forms. This research focuses on the role of
anthropomorphic cues, including gender signals, in influencing human robot
interaction and user acceptance of non-humanoid robots. Through three surveys,
we analyze how design elements such as physical appearance, voice modulation,
and behavioral attributes affect gender perception and task suitability. Our
findings demonstrate that even non-humanoid robots like Spot, Mini-Cheetah, and
drones are subject to gender attribution based on anthropomorphic features,
affecting their perceived roles and operational trustworthiness. The results
underscore the importance of balancing design elements to optimize both
functional efficiency and user relatability, particularly in critical contexts.
♻ ☆ Long-horizon Locomotion and Manipulation on a Quadrupedal Robot with Large Language Models
We present a large language model (LLM) based system to empower quadrupedal
robots with problem-solving abilities for long-horizon tasks beyond short-term
motions. Long-horizon tasks for quadrupeds are challenging since they require
both a high-level understanding of the semantics of the problem for task
planning and a broad range of locomotion and manipulation skills to interact
with the environment. Our system builds a high-level reasoning layer with large
language models, which generates hybrid discrete-continuous plans as robot code
from task descriptions. It comprises multiple LLM agents: a semantic planner
for sketching a plan, a parameter calculator for predicting arguments in the
plan, and a code generator to convert the plan into executable robot code. At
the low level, we adopt reinforcement learning to train a set of motion
planning and control skills to unleash the flexibility of quadrupeds for rich
environment interactions. Our system is tested on long-horizon tasks that are
infeasible to complete with one single skill. Simulation and real-world
experiments show that it successfully figures out multi-step strategies and
demonstrates non-trivial behaviors, including building tools or notifying a
human for help. Demos are available on our project page:
https://sites.google.com/view/long-horizon-robot.
♻ ☆ HGS-Planner: Hierarchical Planning Framework for Active Scene Reconstruction Using 3D Gaussian Splatting
In complex missions such as search and rescue,robots must make intelligent
decisions in unknown environments, relying on their ability to perceive and
understand their surroundings. High-quality and real-time reconstruction
enhances situational awareness and is crucial for intelligent robotics.
Traditional methods often struggle with poor scene representation or are too
slow for real-time use. Inspired by the efficacy of 3D Gaussian Splatting
(3DGS), we propose a hierarchical planning framework for fast and high-fidelity
active reconstruction. Our method evaluates completion and quality gain to
adaptively guide reconstruction, integrating global and local planning for
efficiency. Experiments in simulated and real-world environments show our
approach outperforms existing real-time methods.
♻ ☆ Gaitor: Learning a Unified Representation Across Gaits for Real-World Quadruped Locomotion
The current state-of-the-art in quadruped locomotion is able to produce a
variety of complex motions. These methods either rely on switching between a
discrete set of skills or learn a distribution across gaits using complex
black-box models. Alternatively, we present Gaitor, which learns a disentangled
and 2D representation across locomotion gaits. This learnt representation forms
a planning space for closed-loop control delivering continuous gait transitions
and perceptive terrain traversal. Gaitor's latent space is readily
interpretable and we discover that during gait transitions, novel unseen gaits
emerge. The latent space is disentangled with respect to footswing heights and
lengths. This means that these gait characteristics can be varied independently
in the 2D latent representation. Together with a simple terrain encoding and a
learnt planner operating in the latent space, Gaitor can take motion commands
including desired gait type and swing characteristics all while reacting to
uneven terrain. We evaluate Gaitor in both simulation and the real world on the
ANYmal C platform. To the best of our knowledge, this is the first work
learning a unified and interpretable latent space for multiple gaits, resulting
in continuous blending between different locomotion modes on a real quadruped
robot. An overview of the methods and results in this paper is found at
https://youtu.be/eVFQbRyilCA.
comment: 14 pages, 8 figures, 2 tables, Accepted to CoRL 2024
♻ ☆ Hi-SLAM: Scaling-up Semantics in SLAM with a Hierarchically Categorical Gaussian Splatting
We propose Hi-SLAM, a semantic 3D Gaussian Splatting SLAM method featuring a
novel hierarchical categorical representation, which enables accurate global 3D
semantic mapping, scaling-up capability, and explicit semantic label prediction
in the 3D world. The parameter usage in semantic SLAM systems increases
significantly with the growing complexity of the environment, making it
particularly challenging and costly for scene understanding. To address this
problem, we introduce a novel hierarchical representation that encodes semantic
information in a compact form into 3D Gaussian Splatting, leveraging the
capabilities of large language models (LLMs). We further introduce a novel
semantic loss designed to optimize hierarchical semantic information through
both inter-level and cross-level optimization. Furthermore, we enhance the
whole SLAM system, resulting in improved tracking and mapping performance. Our
Hi-SLAM outperforms existing dense SLAM methods in both mapping and tracking
accuracy, while achieving a 2x operation speed-up. Additionally, it exhibits
competitive performance in rendering semantic segmentation in small synthetic
scenes, with significantly reduced storage and training time requirements.
Rendering FPS impressively reaches 2,000 with semantic information and 3,000
without it. Most notably, it showcases the capability of handling the complex
real-world scene with more than 500 semantic classes, highlighting its valuable
scaling-up capability.
comment: 6 pages, 4 figures
♻ ☆ PointNetPGAP-SLC: A 3D LiDAR-based Place Recognition Approach with Segment-level Consistency Training for Mobile Robots in Horticulture
3D LiDAR-based place recognition remains largely underexplored in
horticultural environments, which present unique challenges due to their
semi-permeable nature to laser beams. This characteristic often results in
highly similar LiDAR scans from adjacent rows, leading to descriptor ambiguity
and, consequently, compromised retrieval performance. In this work, we address
the challenges of 3D LiDAR place recognition in horticultural environments,
particularly focusing on inter-row ambiguity by introducing three key
contributions: (i) a novel model, PointNetPGAP, which combines the outputs of
two statistically-inspired aggregators into a single descriptor; (ii) a
Segment-Level Consistency (SLC) model, used exclusively during training to
enhance descriptor robustness; and (iii) the HORTO-3DLM dataset, comprising
LiDAR sequences from orchards and strawberry fields. Experimental evaluations
conducted on the HORTO-3DLM and KITTI Odometry datasets demonstrate that
PointNetPGAP outperforms state-of-the-art models, including OverlapTransformer
and PointNetVLAD, particularly when the SLC model is applied. These results
underscore the model's superiority, especially in horticultural environments,
by significantly improving retrieval performance in segments with higher
ambiguity.
comment: This preprint has been accepted for publication in IEEE Robotics and
Automation Letters, 2024
♻ ☆ Two is Better Than One: Digital Siblings to Improve Autonomous Driving Testing
Simulation-based testing represents an important step to ensure the
reliability of autonomous driving software. In practice, when companies rely on
third-party general-purpose simulators, either for in-house or outsourced
testing, the generalizability of testing results to real autonomous vehicles is
at stake. In this paper, we enhance simulation-based testing by introducing the
notion of digital siblings, a multi-simulator approach that tests a given
autonomous vehicle on multiple general-purpose simulators built with different
technologies, that operate collectively as an ensemble in the testing process.
We exemplify our approach on a case study focused on testing the lane-keeping
component of an autonomous vehicle. We use two open-source simulators as
digital siblings, and we empirically compare such a multi-simulator approach
against a digital twin of a physical scaled autonomous vehicle on a large set
of test cases. Our approach requires generating and running test cases for each
individual simulator, in the form of sequences of road points. Then, test cases
are migrated between simulators, using feature maps to characterize the
exercised driving conditions. Finally, the joint predicted failure probability
is computed, and a failure is reported only in cases of agreement among the
siblings.
Our empirical evaluation shows that the ensemble failure predictor by the
digital siblings is superior to each individual simulator at predicting the
failures of the digital twin. We discuss the findings of our case study and
detail how our approach can help researchers interested in automated testing of
autonomous driving software.
♻ ☆ ScissorBot: Learning Generalizable Scissor Skill for Paper Cutting via Simulation, Imitation, and Sim2Real
This paper tackles the challenging robotic task of generalizable paper
cutting using scissors. In this task, scissors attached to a robot arm are
driven to accurately cut curves drawn on the paper, which is hung with the top
edge fixed. Due to the frequent paper-scissor contact and consequent fracture,
the paper features continual deformation and changing topology, which is
diffult for accurate modeling. To ensure effective execution, we customize an
action primitive sequence for imitation learning to constrain its action space,
thus alleviating potential compounding errors. Finally, by integrating
sim-to-real techniques to bridge the gap between simulation and reality, our
policy can be effectively deployed on the real robot. Experimental results
demonstrate that our method surpasses all baselines in both simulation and
real-world benchmarks and achieves performance comparable to human operation
with a single hand under the same conditions.
comment: Accepted by CoRL2024
♻ ☆ HBTP: Heuristic Behavior Tree Planning with Large Language Model Reasoning
Behavior Trees (BTs) are increasingly becoming a popular control structure in
robotics due to their modularity, reactivity, and robustness. In terms of BT
generation methods, BT planning shows promise for generating reliable BTs.
However, the scalability of BT planning is often constrained by prolonged
planning times in complex scenarios, largely due to a lack of domain knowledge.
In contrast, pre-trained Large Language Models (LLMs) have demonstrated task
reasoning capabilities across various domains, though the correctness and
safety of their planning remain uncertain. This paper proposes integrating BT
planning with LLM reasoning, introducing Heuristic Behavior Tree Planning
(HBTP)-a reliable and efficient framework for BT generation. The key idea in
HBTP is to leverage LLMs for task-specific reasoning to generate a heuristic
path, which BT planning can then follow to expand efficiently. We first
introduce the heuristic BT expansion process, along with two heuristic variants
designed for optimal planning and satisficing planning, respectively. Then, we
propose methods to address the inaccuracies of LLM reasoning, including action
space pruning and reflective feedback, to further enhance both reasoning
accuracy and planning efficiency. Experiments demonstrate the theoretical
bounds of HBTP, and results from four datasets confirm its practical
effectiveness in everyday service robot applications.
♻ ☆ Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training NeurIPS 2024
Learning a generalist embodied agent capable of completing multiple tasks
poses challenges, primarily stemming from the scarcity of action-labeled
robotic datasets. In contrast, a vast amount of human videos exist, capturing
intricate tasks and interactions with the physical world. Promising prospects
arise for utilizing actionless human videos for pre-training and transferring
the knowledge to facilitate robot policy learning through limited robot
demonstrations. However, it remains a challenge due to the domain gap between
humans and robots. Moreover, it is difficult to extract useful information
representing the dynamic world from human videos, because of its noisy and
multimodal data structure. In this paper, we introduce a novel framework to
tackle these challenges, which leverages a unified discrete diffusion to
combine generative pre-training on human videos and policy fine-tuning on a
small number of action-labeled robot videos. We start by compressing both human
and robot videos into unified video tokens. In the pre-training stage, we
employ a discrete diffusion model with a mask-and-replace diffusion strategy to
predict future video tokens in the latent space. In the fine-tuning stage, we
harness the imagined future videos to guide low-level action learning with a
limited set of robot data. Experiments demonstrate that our method generates
high-fidelity future videos for planning and enhances the fine-tuned policies
compared to previous state-of-the-art approaches with superior performance. Our
project website is available at https://video-diff.github.io/.
comment: Accepted by NeurIPS 2024. 24 pages